week 13: Memory networks

0. Introduction

本次实验的代码主要参考自github的开源代码,原文请点击这里。 如对本次课件的内容有任何疑惑的同学可以直接微信我或者邮件到cuizhiying.csu@gmail.com

0.1 Experimental content and requirements

本次实验内容主要实现和运行 Dynamic Memory Networks for Visual and Textual Question Answering (2016) ,使用的是数据集是bAbI数据集 The (20) QA bAbI tasks,具体要求如下:

  1. 体会Memory Network的基本框架结构,阅读实验代码,结合理论课的内容,加深对Memory Network的思考和理解
  2. 独立完成实验指导书中提出的问题(简要回答)
  3. 按照实验指导书的引导,填充缺失部分的代码,让程序顺利地运转起来
  4. 坚持独立完成,禁止抄袭
  5. 实验结束后,将整个文件夹下载下来(注意保留程序运行结果),打包上传到超算课堂网站中(统一使用zip格式压缩)。

今天进行的任务是一个典型的QA( Question answering )任务, 这是一个我们之前还没有接触过的任务,所以在进行实验之前,强烈建议同学们可以在课堂之余,看一下相关的文章。对模型结构先有一个清晰认识能够很好帮助我们理解代码。当然,一边看代码一边理解也是一个很好的选择。

以下四篇论文,在Memory Networks的演变上有清晰的脉络,可供参考。实验中的代码针对的是第四篇文章。

  1. Memory Networks (2015)
  2. End-To-End Memory Networks (2015)
  3. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (2016)
  4. Dynamic Memory Networks for Visual and Textual Question Answering (2016)

若果对网络的理解还不够透彻,又不想花太多时间看论文的同学,强烈建议阅读这篇论文总结, 这份文章总结的思路非常清晰,能够非常有效地帮助大家更好地理解和完成本次实验。

1. Dataset Explore

1.1 Intuition

本次实验用到是数据集是bAbI数据集中的 The (20) QA bAbI tasks。在数据存放在./data/en-10k/中,以txt的文件形式存放着,建议先打开文件来看一下,有个感性的认识。

官方说明引用如下:

This section presents the first set of 20 tasks for testing text understanding and reasoning in the bAbI project.
The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models.

The file format for each task is as follows:

ID text
ID text
ID text
ID question[tab]answer[tab]supporting fact IDS.
...

The IDs for a given “story” start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new “story”. Supporting fact IDs only ever reference the sentences within a “story”.

For Example:

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary?        bathroom        1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel?      hallway 4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel?      hallway 4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel?     office  11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra?     bathroom        8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra?      bathroom        2

以上应该也是你们在该文件中看到的数据集的样子。

1.2 dataset prepare

处理以上格式的数据跟我们以前的任务不太一样,要繁琐很多。数据的预处理自然很重要,但不是本次实验的重点。所以,我们在这次实验中以已经处理好的DatasetDataloader的形式直接提供给大家使用。

我们只需要知道的是在本次实验中,数据预处理的基本思路是将上述文本,编码成了了单词索引的形式,然后将长短不一的文段采用最大段数的形式,以类似于补0的形式,padding到了同样的长度。这样子,我们就可以以batch的形式进行输入了。如果还没有理解的话,没有关系,我们接下来的测试。(感兴趣的同学可以直接阅读源代码./babi_loader.py


In [1]:
from babi_loader import BabiDataset, pad_collate

In [2]:
from torch.utils.data import DataLoader
# There are 20 tasks, we should control which task we would like to load
task_id = 1
dataset = BabiDataset( task_id )
dataloader = DataLoader(dataset, batch_size=4, shuffle=False, collate_fn=pad_collate)

好了,现在我们导入了dataloader之后,可以尝试将每一个batch的数据打印出来看看是什么样子的


In [3]:
contexts, questions, answers = next(iter(dataloader))
print(contexts.shape)
print(questions.shape)
print(answers.shape)
print(contexts)
print(questions)
print(answers)


torch.Size([4, 8, 8])
torch.Size([4, 4])
torch.Size([4])
tensor([[[ 2.,  3.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 8.,  9.,  4.,  5., 10.,  7.,  1.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],

        [[ 2.,  3.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 8.,  9.,  4.,  5., 10.,  7.,  1.,  0.],
         [13.,  9., 14.,  4.,  5., 10.,  7.,  1.],
         [15.,  3.,  4.,  5., 16.,  7.,  1.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],

        [[ 2.,  3.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 8.,  9.,  4.,  5., 10.,  7.,  1.,  0.],
         [13.,  9., 14.,  4.,  5., 10.,  7.,  1.],
         [15.,  3.,  4.,  5., 16.,  7.,  1.,  0.],
         [ 8.,  3.,  4.,  5., 17.,  7.,  1.,  0.],
         [15., 18.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],

        [[ 2.,  3.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 8.,  9.,  4.,  5., 10.,  7.,  1.,  0.],
         [13.,  9., 14.,  4.,  5., 10.,  7.,  1.],
         [15.,  3.,  4.,  5., 16.,  7.,  1.,  0.],
         [ 8.,  3.,  4.,  5., 17.,  7.,  1.,  0.],
         [15., 18.,  4.,  5.,  6.,  7.,  1.,  0.],
         [ 2.,  3.,  4.,  5., 10.,  7.,  1.,  0.],
         [13., 19.,  4.,  5., 17.,  7.,  1.,  0.]]], dtype=torch.float64)
tensor([[11, 12,  2,  1],
        [11, 12, 13,  1],
        [11, 12, 13,  1],
        [11, 12, 13,  1]])
tensor([ 6, 10, 10, 17])

可以看到,打印出来的都是数字索引,不是真实的文本,所以我们在输出之前,肯定需要对这些索引进行重新映射,找回原来的文本信息,我们先看一下查找表


In [4]:
references = dataset.QA.IVOCAB
print(references)


{0: '<PAD>', 1: '<EOS>', 2: 'mary', 3: 'moved', 4: 'to', 5: 'the', 6: 'bathroom', 7: '.', 8: 'john', 9: 'went', 10: 'hallway', 11: 'where', 12: 'is', 13: 'daniel', 14: 'back', 15: 'sandra', 16: 'garden', 17: 'office', 18: 'journeyed', 19: 'travelled', 20: 'bedroom', 21: 'kitchen'}

好了,根据这个查找表,我们就可以很顺利得将我们的数据还原出来了。只需要按照索引重新映射回来即可。


In [5]:
def interpret_indexed_tensor(contexts, questions, answers):
    for n, data in enumerate(zip(contexts, questions, answers)):
        context  = data[0]
        question = data[1]
        answer   = data[2]
        
        print(n)
        for i, sentence in enumerate(context):
            s = ' '.join([references[elem.item()] for elem in sentence])
            print(f'{i}th sentence: {s}')
        s = ' '.join([references[elem.item()] for elem in question])
        print(f'question: {s}')
        s = references[answer.item()]
        print(f'answer:   {s}')

In [6]:
interpret_indexed_tensor(contexts, questions, answers)


0
0th sentence: mary moved to the bathroom . <EOS> <PAD>
1th sentence: john went to the hallway . <EOS> <PAD>
2th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
3th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
4th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
5th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: where is mary <EOS>
answer:   bathroom
1
0th sentence: mary moved to the bathroom . <EOS> <PAD>
1th sentence: john went to the hallway . <EOS> <PAD>
2th sentence: daniel went back to the hallway . <EOS>
3th sentence: sandra moved to the garden . <EOS> <PAD>
4th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
5th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: where is daniel <EOS>
answer:   hallway
2
0th sentence: mary moved to the bathroom . <EOS> <PAD>
1th sentence: john went to the hallway . <EOS> <PAD>
2th sentence: daniel went back to the hallway . <EOS>
3th sentence: sandra moved to the garden . <EOS> <PAD>
4th sentence: john moved to the office . <EOS> <PAD>
5th sentence: sandra journeyed to the bathroom . <EOS> <PAD>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: where is daniel <EOS>
answer:   hallway
3
0th sentence: mary moved to the bathroom . <EOS> <PAD>
1th sentence: john went to the hallway . <EOS> <PAD>
2th sentence: daniel went back to the hallway . <EOS>
3th sentence: sandra moved to the garden . <EOS> <PAD>
4th sentence: john moved to the office . <EOS> <PAD>
5th sentence: sandra journeyed to the bathroom . <EOS> <PAD>
6th sentence: mary moved to the hallway . <EOS> <PAD>
7th sentence: daniel travelled to the office . <EOS> <PAD>
question: where is daniel <EOS>
answer:   office

从上面测试中,我们就清楚我们该如何使用这个Dataloader了。

同时,通过观察上面的输入和输出,我们应该看到:

  1. <PAD>符号:在同一个batch中,所以的句子都进行了padding的操作,将一个batch中的所有的句子都补到和最长的句子一样长。同时,每个问题也是,比如batchsize为8,则说明这个batch中共有8个问题,但是有个故事可能句子多些,有的故事句子少些,统一padding到最多句子的形式就好了。在rnn中,我们本不必统一句子和文段的长度,在这里使用padding的方式固定长度是出于训练的需要,定长才能将句子和文段以batch的形式输入进行处理,是文本处理的一个常用的方式
  2. <EOS>符号:end of sentence
  3. 有的句子可以同时被多个问题引用到
  4. 预测的答案是输入的sentences中所包含的单词中的一个。

2. Dynamic Memory Networks

在理解这个网络结构之前,一定要认真阅读上面提到的论文总结,上面的讲述很详细,很多很细节的地方,我在这里就不会跟那篇文章那样子对比得那么详细了,强烈建议先重新回顾一下上述论文的演变过程,对网络的运作有个清楚的认识,重温总结,点击这里

2.1 Architecture

网络的总体架构入下图所示。

从上图中可以看出要实现这个QA系统,我们需要实现4个模块:Input Module, Memory Module, Question Module and Answer Module. 我们在下来的实现中看细节。

2.2 Input Module

首先导入一些常用的包


In [7]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as init
from torch.autograd import Variable

整个DMN的过程使用了很多的GRU作为编码器,在这里的话,我们可以稍作回顾,一般的情况下,GRU的更新过程如下公式所示:

一般情况下,我们直接将GRU看做简化版的LSTM即可。

首先,我们需要对输入的句子进行编码,这里使用的是bi-directional GRU,另外注意到在forward函数中,我们用到了一个参数word_embedding参数。这里面实际上传入的是embedding函数,如nn.Embedding。因为我们分成了不同的模块来编写这份代码,所以我们需要设置这样子一个参数,来使得所有的word_embedding保持一致。整个过程如下图所示:

在这里使用bi-directional GRU来更好得获取前后文信息。如果对GRU模块不清楚的,可以点击这里,快速回顾一下. 在这里,bi-directional GRU的更新方式如下公式所示:


In [8]:
class InputModule(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(InputModule, self).__init__()
        self.hidden_size = hidden_size
        
        ##################################################################################
        #
        # nn.GRU use the bidirectional parameter to control the type of GRU
        #
        ##################################################################################
        self.gru = nn.GRU(hidden_size, hidden_size, bidirectional=True, batch_first=True)
        self.dropout = nn.Dropout(0.1)

    # we should all the word_embedding the same, while we using the different module
    def forward(self, contexts, word_embedding):
        '''
        contexts.size() -> (#batch, #sentence, #token)
        word_embedding() -> (#batch, #sentence x #token, #embedding)
        position_encoding() -> (#batch, #sentence, #embedding)
        facts.size() -> (#batch, #sentence, #hidden = #embedding)
        '''
        batch_num, sen_num, token_num = contexts.size()

        contexts = contexts.view(batch_num, -1)
        contexts = word_embedding(contexts)

        contexts = contexts.view(batch_num, sen_num, token_num, -1)
        contexts = self.position_encoding(contexts)
        contexts = self.dropout(contexts)

        #########################################################################
        #
        # if you change the gru type, you should also change the initial hidden state
        # as bidirectional gru's shape is (2, *, *), while normal gru just need( 1, *, *)
        #
        #########################################################################
        h0 = Variable(torch.zeros(2, batch_num, self.hidden_size).cuda())
        facts, hdn = self.gru(contexts, h0)
        #########################################################################
        #
        # if you use bi-directional GRU, you should fusion the output,
        # if you use normal GRU, commont the following code. 
        # acconding to the equation (6)
        #
        #########################################################################
        facts = facts[:, :, :self.hidden_size] + facts[:, :, self.hidden_size:]
        return facts
    
    def position_encoding(self, embedded_sentence):
        '''
        embedded_sentence.size() -> (#batch, #sentence, #token, #embedding)
        l.size() -> (#sentence, #embedding)
        output.size() -> (#batch, #sentence, #embedding)
        '''
        _, _, slen, elen = embedded_sentence.size()

        l = [[(1 - s/(slen-1)) - (e/(elen-1)) * (1 - 2*s/(slen-1)) for e in range(elen)] for s in range(slen)]
        l = torch.FloatTensor(l)
        l = l.unsqueeze(0) # for #batch
        l = l.unsqueeze(1) # for #sen
        l = l.expand_as(embedded_sentence)
        weighted = embedded_sentence * Variable(l.cuda())
        return torch.sum(weighted, dim=2).squeeze(2) # sum with tokens

在上面的模块中,你可能注意到了self.position_encoding()这个函数,它的作用在于给输入的句子加上位置信息。即,position_encoding的作用是将输入的句子{我, 爱, 你}转换成{我1, 爱2, 你3}的形式进行输出。这样子做的含义如下文引用所示。 具体含义引用如下

词袋模型本身是无序的,句子“我爱你”和“你爱我”在BOW中都是{我,爱,你},模型本无法区分这两句话不同的含义,但如果给每个词加上position encoding,变成{我1,爱2, 你3}和{我3,爱2,你1},则变成不同的数据,所以就是位置编码就是一种特征。

作者:HideOnBooks
链接:https://www.zhihu.com/question/56476625/answer/416928811
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

2.3 Question Module

Question Module实际上只是对问题文本信息(一个句子,比如where are you?),使用一个普通的GRU进行编码,然后,将编码信息输送Memory Module中进行Attention操作。代码比较简单,不复赘言。


In [9]:
class QuestionModule(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(QuestionModule, self).__init__()
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)

    def forward(self, questions, word_embedding):
        '''
        questions.size() -> (#batch, #token)
        word_embedding() -> (#batch, #token, #embedding)
        gru() -> (1, #batch, #hidden)
        '''
        questions = word_embedding(questions)
        _, questions = self.gru(questions)
        questions = questions.transpose(0, 1)
        return questions

2.4 Memory Module

Memory 模块的示意图如下所示:

在这里,我们先获取经过Input Module编码过后的信息$F$,然后进行输入Attention Mechanism中,迭代查找相关信息。Attention Mechanism中隐含了从Question Module中获得的对问题的文本描述进行GRU编码之后的信息输入。
Attention实际上是在比较和查找QuestionInput之间的联系


In [10]:
class AttentionGRUCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(AttentionGRUCell, self).__init__()
        self.hidden_size = hidden_size
        
        self.Wr = nn.Linear(input_size, hidden_size)
        self.Ur = nn.Linear(hidden_size, hidden_size)
        self.W = nn.Linear(input_size, hidden_size)
        self.U = nn.Linear(hidden_size, hidden_size)

    def forward(self, fact, C, g):
        r = torch.sigmoid(self.Wr(fact) + self.Ur(C))
        h_tilda = torch.tanh(self.W(fact) + r * self.U(C))
        g = g.unsqueeze(1).expand_as(h_tilda)
        h = g * h_tilda + (1 - g) * C
        return h

In [11]:
class AttentionGRU(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(AttentionGRU, self).__init__()
        self.hidden_size = hidden_size
        self.AGRUCell = AttentionGRUCell(input_size, hidden_size)

    def forward(self, facts, G):
        batch_num, sen_num, embedding_size = facts.size()
        C = Variable(torch.zeros(self.hidden_size)).cuda()
        for sid in range(sen_num):
            fact = facts[:, sid, :]
            g = G[:, sid]
            if sid == 0:
                C = C.unsqueeze(0).expand_as(fact)
            C = self.AGRUCell(fact, C, g)
        return C


In [12]:
class EpisodicMemory(nn.Module):
    def __init__(self, hidden_size):
        super(EpisodicMemory, self).__init__()
        self.AGRU = AttentionGRU(hidden_size, hidden_size)
        self.z1 = nn.Linear(4 * hidden_size, hidden_size)
        self.z2 = nn.Linear(hidden_size, 1)
        self.next_mem = nn.Linear(3 * hidden_size, hidden_size)

    def make_interaction(self, facts, questions, prevM):
        '''
        facts.size() -> (#batch, #sentence, #hidden = #embedding)
        questions.size() -> (#batch, 1, #hidden)
        prevM.size() -> (#batch, #sentence = 1, #hidden = #embedding)
        z.size() -> (#batch, #sentence, 4 x #embedding)
        G.size() -> (#batch, #sentence)
        '''
        batch_num, sen_num, embedding_size = facts.size()
        questions = questions.expand_as(facts)
        prevM = prevM.expand_as(facts)

        z = torch.cat([
            facts * questions,
            facts * prevM,
            torch.abs(facts - questions),
            torch.abs(facts - prevM)
        ], dim=2)

        z = z.view(-1, 4 * embedding_size)

        G = torch.tanh(self.z1(z))
        G = self.z2(G)
        G = G.view(batch_num, -1)
        G = F.softmax(G, dim=1)

        return G
    
    def forward(self, facts, questions, prevM):
        '''
        facts.size() -> (#batch, #sentence, #hidden = #embedding)
        questions.size() -> (#batch, #sentence = 1, #hidden)
        prevM.size() -> (#batch, #sentence = 1, #hidden = #embedding)
        G.size() -> (#batch, #sentence)
        C.size() -> (#batch, #hidden)
        concat.size() -> (#batch, 3 x #embedding)
        '''
        G = self.make_interaction(facts, questions, prevM)
        C = self.AGRU(facts, G)
        concat = torch.cat([prevM.squeeze(1), C, questions.squeeze(1)], dim=1)
        next_mem = F.relu(self.next_mem(concat))
        next_mem = next_mem.unsqueeze(1)
        return next_mem

2.5 Answer Module

Answer ModuleMemory ModuleQuestion Module的信息使用全连接层结合到一起,然后输出一个文本中所有单词的对应着问题答案的可能性。


In [13]:
class AnswerModule(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(AnswerModule, self).__init__()
        self.z = nn.Linear(2 * hidden_size, vocab_size)
        self.dropout = nn.Dropout(0.1)

    def forward(self, M, questions):
        M = self.dropout(M)
        concat = torch.cat([M, questions], dim=2).squeeze(1)
        z = self.z(concat)
        return z

2.6 Combine

重新回顾一下整体的网络结构

Note: 虽然本文实现的网络结构不是这一幅图(这幅结构图是上述四篇论文中的第三篇论文所提供的结构图),但是在此还是借用了这幅图,因为,这两篇论文的总体结构都非常相似,两者只是在部分结构的细节上有所差异,因此,在此仍旧借用这幅图来表述总体的网络结构图。其中的信息流动几乎是一毛一样的。


In [14]:
class DMNPlus(nn.Module):
    '''
    This class combine all the module above. The data flow is showed as the above image.
    '''
    def __init__(self, hidden_size, vocab_size, num_hop=3, qa=None):
        super(DMNPlus, self).__init__()
        self.num_hop = num_hop
        self.qa = qa
        self.word_embedding = nn.Embedding(vocab_size, hidden_size, padding_idx=0, sparse=True).cuda()
        self.criterion = nn.CrossEntropyLoss(reduction='sum')

        self.input_module = InputModule(vocab_size, hidden_size)
        self.question_module = QuestionModule(vocab_size, hidden_size)
        self.memory = EpisodicMemory(hidden_size)
        self.answer_module = AnswerModule(vocab_size, hidden_size)

    def forward(self, contexts, questions):
        '''
        contexts.size() -> (#batch, #sentence, #token) -> (#batch, #sentence, #hidden = #embedding)
        questions.size() -> (#batch, #token) -> (#batch, 1, #hidden)
        '''
        facts = self.input_module(contexts, self.word_embedding)
        questions = self.question_module(questions, self.word_embedding)
        M = questions
        for hop in range(self.num_hop):
            M = self.memory(facts, questions, M)
        preds = self.answer_module(M, questions)
        return preds

    # train the index into a word, it's similar to the part 1 data explore
    def interpret_indexed_tensor(self, var):
        if len(var.size()) == 3:
            # var -> n x #sen x #token
            for n, sentences in enumerate(var):
                for i, sentence in enumerate(sentences):
                    s = ' '.join([self.qa.IVOCAB[elem.data[0]] for elem in sentence])
                    print(f'{n}th of batch, {i}th sentence, {s}')
        elif len(var.size()) == 2:
            # var -> n x #token
            for n, sentence in enumerate(var):
                s = ' '.join([self.qa.IVOCAB[elem.data[0]] for elem in sentence])
                print(f'{n}th of batch, {s}')
        elif len(var.size()) == 1:
            # var -> n (one token per batch)
            for n, token in enumerate(var):
                s = self.qa.IVOCAB[token.data[0]]
                print(f'{n}th of batch, {s}')

    # calculate the loss of the network
    def get_loss(self, contexts, questions, targets):
        output = self.forward(contexts, questions)
        loss = self.criterion(output, targets)
        reg_loss = 0
        for param in self.parameters():
            reg_loss += 0.001 * torch.sum(param * param)
        preds = F.softmax(output, dim=1)
        _, pred_ids = torch.max(preds, dim=1)
        corrects = (pred_ids.data == answers.cuda().data)
        acc = torch.mean(corrects.float())
        return loss + reg_loss, acc

3. Training and Test

3.1 Train the Network

bAbI数据集共有20个小的QA任务,为了训练简单,我们每次只训练一个任务,请将下面的一个 cell 中的 task_id 设置成你的学号的最后一个数字,尾号为 0 的同学将task_id设置为10


In [26]:
task_id = 10

epochs = 3

if task_id in range(1, 21):
#def train(task_id):
    dset = BabiDataset(task_id)
    vocab_size = len(dset.QA.VOCAB)
    hidden_size = 80
    

    model = DMNPlus(hidden_size, vocab_size, num_hop=3, qa=dset.QA)
    model.cuda()
    
    best_acc = 0
    optim = torch.optim.Adam(model.parameters())

    for epoch in range(epochs):
        dset.set_mode('train')
        train_loader = DataLoader(
                dset, batch_size=32, shuffle=True, collate_fn=pad_collate
                )

        model.train()
        total_acc = 0
        cnt = 0
        for batch_idx, data in enumerate(train_loader):
            contexts, questions, answers = data
            batch_size = contexts.size()[0]
            contexts  = Variable(contexts.long().cuda())
            questions = Variable(questions.long().cuda())
            answers   = Variable(answers.cuda())

            loss, acc = model.get_loss(contexts, questions, answers)
            optim.zero_grad()
            loss.backward()
            optim.step()
            
            total_acc += acc * batch_size
            cnt += batch_size
            if batch_idx % 20 == 0:
                print(f'[Task {task_id}, Epoch {epoch}] [Training] loss : {loss.item(): {10}.{8}}, acc : {total_acc / cnt: {5}.{4}}, batch_idx : {batch_idx}')


[Task 10, Epoch 0] [Training] loss :  106.01664, acc :  0.25, batch_idx : 0
[Task 10, Epoch 0] [Training] loss :  40.160023, acc :  0.4598, batch_idx : 20
[Task 10, Epoch 0] [Training] loss :  35.166271, acc :  0.4451, batch_idx : 40
[Task 10, Epoch 0] [Training] loss :  43.961296, acc :  0.438, batch_idx : 60
[Task 10, Epoch 0] [Training] loss :  31.439713, acc :  0.4425, batch_idx : 80
[Task 10, Epoch 0] [Training] loss :   32.85585, acc :  0.4462, batch_idx : 100
[Task 10, Epoch 0] [Training] loss :  32.927139, acc :  0.445, batch_idx : 120
[Task 10, Epoch 0] [Training] loss :  37.371578, acc :  0.4441, batch_idx : 140
[Task 10, Epoch 0] [Training] loss :  31.568272, acc :  0.4513, batch_idx : 160
[Task 10, Epoch 0] [Training] loss :  30.929462, acc :  0.4577, batch_idx : 180
[Task 10, Epoch 0] [Training] loss :  29.434761, acc :  0.4622, batch_idx : 200
[Task 10, Epoch 0] [Training] loss :  31.283617, acc :  0.4596, batch_idx : 220
[Task 10, Epoch 0] [Training] loss :  30.693546, acc :  0.4593, batch_idx : 240
[Task 10, Epoch 0] [Training] loss :  32.336746, acc :  0.4594, batch_idx : 260
[Task 10, Epoch 0] [Training] loss :  30.624126, acc :  0.4597, batch_idx : 280
[Task 10, Epoch 1] [Training] loss :  32.483063, acc :   0.5, batch_idx : 0
[Task 10, Epoch 1] [Training] loss :  31.752934, acc :  0.4792, batch_idx : 20
[Task 10, Epoch 1] [Training] loss :  29.740837, acc :  0.4619, batch_idx : 40
[Task 10, Epoch 1] [Training] loss :  32.197277, acc :  0.4606, batch_idx : 60
[Task 10, Epoch 1] [Training] loss :  34.138874, acc :  0.4556, batch_idx : 80
[Task 10, Epoch 1] [Training] loss :  27.886118, acc :  0.4551, batch_idx : 100
[Task 10, Epoch 1] [Training] loss :  31.685862, acc :  0.4592, batch_idx : 120
[Task 10, Epoch 1] [Training] loss :  29.508682, acc :  0.4619, batch_idx : 140
[Task 10, Epoch 1] [Training] loss :  33.132858, acc :  0.4577, batch_idx : 160
[Task 10, Epoch 1] [Training] loss :   32.70747, acc :  0.4591, batch_idx : 180
[Task 10, Epoch 1] [Training] loss :  28.314686, acc :  0.461, batch_idx : 200
[Task 10, Epoch 1] [Training] loss :  33.346001, acc :  0.463, batch_idx : 220
[Task 10, Epoch 1] [Training] loss :  34.441978, acc :  0.463, batch_idx : 240
[Task 10, Epoch 1] [Training] loss :  33.728062, acc :  0.4661, batch_idx : 260
[Task 10, Epoch 1] [Training] loss :  29.007236, acc :  0.4683, batch_idx : 280
[Task 10, Epoch 2] [Training] loss :  30.992353, acc :  0.4688, batch_idx : 0
[Task 10, Epoch 2] [Training] loss :  28.893627, acc :  0.4747, batch_idx : 20
[Task 10, Epoch 2] [Training] loss :   29.06044, acc :  0.4779, batch_idx : 40
[Task 10, Epoch 2] [Training] loss :  30.939363, acc :  0.4821, batch_idx : 60
[Task 10, Epoch 2] [Training] loss :  34.449421, acc :  0.4753, batch_idx : 80
[Task 10, Epoch 2] [Training] loss :  30.197206, acc :  0.4712, batch_idx : 100
[Task 10, Epoch 2] [Training] loss :    30.6458, acc :  0.476, batch_idx : 120
[Task 10, Epoch 2] [Training] loss :  30.028925, acc :  0.477, batch_idx : 140
[Task 10, Epoch 2] [Training] loss :  30.042446, acc :  0.4763, batch_idx : 160
[Task 10, Epoch 2] [Training] loss :  31.846462, acc :  0.4732, batch_idx : 180
[Task 10, Epoch 2] [Training] loss :  33.384731, acc :  0.4726, batch_idx : 200
[Task 10, Epoch 2] [Training] loss :  33.844685, acc :  0.4713, batch_idx : 220
[Task 10, Epoch 2] [Training] loss :  29.480936, acc :  0.4735, batch_idx : 240
[Task 10, Epoch 2] [Training] loss :  28.608585, acc :  0.4734, batch_idx : 260
[Task 10, Epoch 2] [Training] loss :  31.549011, acc :  0.4759, batch_idx : 280

3.2 Test

测试集上,该任务的准确率是多少


In [28]:
dset.set_mode('test')
test_loader = DataLoader(
    dset, batch_size=100, shuffle=False, collate_fn=pad_collate
    )
test_acc = 0
cnt = 0

model.eval()
for batch_idx, data in enumerate(test_loader):
    contexts, questions, answers = data
    batch_size = contexts.size()[0]
    contexts = Variable(contexts.long().cuda())
    questions = Variable(questions.long().cuda())
    answers = Variable(answers.cuda())
    
    _, acc = model.get_loss(contexts, questions, answers)
    test_acc += acc * batch_size
    cnt += batch_size
print(f'Task {task_id}, Epoch {epoch}] [Test] Accuracy : {test_acc / cnt: {5}.{4}}')


Task 10, Epoch 2] [Test] Accuracy :  0.437

3.3 Show example

我们不妨将输出结果转换会文本信息,目测一下该模型的表现, 首先,我们先重新编辑一下上面提供的解释函数。


In [29]:
def interpret_indexed_tensor(references, contexts, questions, answers, predictions):
    for n, data in enumerate(zip(contexts, questions, answers, predictions)):
        context  = data[0]
        question = data[1]
        answer   = data[2]
        predict  = data[3]
        
        print(n)
        for i, sentence in enumerate(context):
            s = ' '.join([references[elem.item()] for elem in sentence])
            print(f'{i}th sentence: {s}')
        q = ' '.join([references[elem.item()] for elem in question])
        print(f'question: {q}')
        a = references[answer.item()]
        print(f'answer:   {a}')
        p = references[predict.argmax().item()]
        print(f'predict:  {p}')

In [30]:
dset.set_mode('test')
dataloader = DataLoader(dset, batch_size=4, shuffle=False, collate_fn=pad_collate)
contexts, questions, answers = next(iter(dataloader))

contexts  = Variable(contexts.long().cuda())
questions = Variable(questions.long().cuda())

# prediction
model.eval()
predicts = model(contexts, questions)

references = dset.QA.IVOCAB
#contexts, questions, answers = next(iter(dataloader))
#interpret_indexed_tensor(contexts, questions, answers)
interpret_indexed_tensor(references, contexts, questions, answers, predicts)


0
0th sentence: mary is in the school . <EOS> <PAD> <PAD> <PAD> <PAD>
1th sentence: bill is in the kitchen . <EOS> <PAD> <PAD> <PAD> <PAD>
2th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
3th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
4th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
5th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: is bill in the bedroom <EOS>
answer:   no
predict:  no
1
0th sentence: mary is in the school . <EOS> <PAD> <PAD> <PAD> <PAD>
1th sentence: bill is in the kitchen . <EOS> <PAD> <PAD> <PAD> <PAD>
2th sentence: bill journeyed to the bedroom . <EOS> <PAD> <PAD> <PAD> <PAD>
3th sentence: fred travelled to the cinema . <EOS> <PAD> <PAD> <PAD> <PAD>
4th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
5th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: is bill in the bedroom <EOS>
answer:   yes
predict:  no
2
0th sentence: mary is in the school . <EOS> <PAD> <PAD> <PAD> <PAD>
1th sentence: bill is in the kitchen . <EOS> <PAD> <PAD> <PAD> <PAD>
2th sentence: bill journeyed to the bedroom . <EOS> <PAD> <PAD> <PAD> <PAD>
3th sentence: fred travelled to the cinema . <EOS> <PAD> <PAD> <PAD> <PAD>
4th sentence: fred went back to the park . <EOS> <PAD> <PAD> <PAD>
5th sentence: bill is either in the school or the office . <EOS>
6th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
7th sentence: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
question: is bill in the park <EOS>
answer:   no
predict:  no
3
0th sentence: mary is in the school . <EOS> <PAD> <PAD> <PAD> <PAD>
1th sentence: bill is in the kitchen . <EOS> <PAD> <PAD> <PAD> <PAD>
2th sentence: bill journeyed to the bedroom . <EOS> <PAD> <PAD> <PAD> <PAD>
3th sentence: fred travelled to the cinema . <EOS> <PAD> <PAD> <PAD> <PAD>
4th sentence: fred went back to the park . <EOS> <PAD> <PAD> <PAD>
5th sentence: bill is either in the school or the office . <EOS>
6th sentence: mary went to the cinema . <EOS> <PAD> <PAD> <PAD> <PAD>
7th sentence: julie is either in the school or the office . <EOS>
question: is fred in the park <EOS>
answer:   yes
predict:  no

4. Exercise

1、QA问题和关系型数据库(如对关系型数据库有所遗忘的话,请点击这里,可稍作回顾)的检索有什么区别?简要说说你的理解

答:关系型数据库的检索是通过建立在相应字段上的索引和条件表达式来实现的,而QA问题的检索式根据记忆状态信息来实现的。关系型数据库可以在字段上建立索引,然后通过索引得到数据的存储位置,从而获取数据。QA问题将问答的知识以memory state的形式存储起来,训练时可以更新memory state,而检索时只需要将问题转换为相应的特征值,然后通过DMN产生输出,再转换为自然语言。

2、bAbI数据集的qa问题是怎么划分为了20个任务,划分的依据是什么?请到根据上文的提示,打开数据集文件进行观察,然后回答

答:bAbI数据集的QA问题是以答案类型或答案依据来划分的。比如任务1是单个事实(句子)支持的答案,任务2、3是两个、三个事实(句子)支持的答案。比如任务10是回答yes、no、maybe,而任务7是回答数量。

3、编写几行代码,探索一下在这个数据集中一个Task中训练集有多少条,测试集有多少条(提示是下面一个cell的代码)


In [35]:
# default dataset is training set
dataset = BabiDataset( task_id )
print('train set:', len(dataset))

# change the mode into "test", now dataset is testset
dataset.set_mode("test")
print('test set:', len(dataset))


train set: 9000
test set: 1000

答:任务10的训练集为9000条,测试集为1000条。

4、根据注释提示,将input module中的bi-directional GRU改成普通的GRU,重新运行网络。


In [37]:
class InputModule(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(InputModule, self).__init__()
        self.hidden_size = hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(0.1)

    def forward(self, contexts, word_embedding):
        '''
        contexts.size() -> (#batch, #sentence, #token)
        word_embedding() -> (#batch, #sentence x #token, #embedding)
        position_encoding() -> (#batch, #sentence, #embedding)
        facts.size() -> (#batch, #sentence, #hidden = #embedding)
        '''
        batch_num, sen_num, token_num = contexts.size()

        contexts = contexts.view(batch_num, -1)
        contexts = word_embedding(contexts)
        
        contexts = contexts.view(batch_num, sen_num, token_num, -1)
        contexts = self.position_encoding(contexts)
        contexts = self.dropout(contexts)
        
        facts, hdn = self.gru(contexts)
        return facts

    def position_encoding(self, embedded_sentence):
        '''
        embedded_sentence.size() -> (#batch, #sentence, #token, #embedding)
        l.size() -> (#sentence, #embedding)
        output.size() -> (#batch, #sentence, #embedding)
        '''
        _, _, slen, elen = embedded_sentence.size()

        l = [[(1 - s/(slen-1)) - (e/(elen-1)) * (1 - 2*s/(slen-1)) for e in range(elen)] for s in range(slen)]
        l = torch.FloatTensor(l)
        l = l.unsqueeze(0) # for #batch
        l = l.unsqueeze(1) # for #sen
        l = l.expand_as(embedded_sentence)
        weighted = embedded_sentence * Variable(l.cuda())
        return torch.sum(weighted, dim=2).squeeze(2) # sum with tokens

In [38]:
task_id = 10

epochs = 3

if task_id in range(1, 21):
#def train(task_id):
    dset = BabiDataset(task_id)
    vocab_size = len(dset.QA.VOCAB)
    hidden_size = 80
    

    model = DMNPlus(hidden_size, vocab_size, num_hop=3, qa=dset.QA)
    model.cuda()
    
    best_acc = 0
    optim = torch.optim.Adam(model.parameters())

    for epoch in range(epochs):
        dset.set_mode('train')
        train_loader = DataLoader(
                dset, batch_size=32, shuffle=True, collate_fn=pad_collate
                )

        model.train()
        total_acc = 0
        cnt = 0
        for batch_idx, data in enumerate(train_loader):
            contexts, questions, answers = data
            batch_size = contexts.size()[0]
            contexts  = Variable(contexts.long().cuda())
            questions = Variable(questions.long().cuda())
            answers   = Variable(answers.cuda())

            loss, acc = model.get_loss(contexts, questions, answers)
            optim.zero_grad()
            loss.backward()
            optim.step()
            
            total_acc += acc * batch_size
            cnt += batch_size
            if batch_idx % 20 == 0:
                print(f'[Task {task_id}, Epoch {epoch}] [Training] loss : {loss.item(): {10}.{8}}, acc : {total_acc / cnt: {5}.{4}}, batch_idx : {batch_idx}')


[Task 10, Epoch 0] [Training] loss :  105.35245, acc :   0.0, batch_idx : 0
[Task 10, Epoch 0] [Training] loss :  32.426529, acc :  0.4152, batch_idx : 20
[Task 10, Epoch 0] [Training] loss :  34.493927, acc :  0.4108, batch_idx : 40
[Task 10, Epoch 0] [Training] loss :  31.886969, acc :  0.4196, batch_idx : 60
[Task 10, Epoch 0] [Training] loss :  36.208771, acc :  0.4267, batch_idx : 80
[Task 10, Epoch 0] [Training] loss :  33.786331, acc :  0.4332, batch_idx : 100
[Task 10, Epoch 0] [Training] loss :  36.083965, acc :  0.436, batch_idx : 120
[Task 10, Epoch 0] [Training] loss :  33.739494, acc :  0.4424, batch_idx : 140
[Task 10, Epoch 0] [Training] loss :  31.794558, acc :  0.4429, batch_idx : 160
[Task 10, Epoch 0] [Training] loss :  34.859913, acc :  0.4456, batch_idx : 180
[Task 10, Epoch 0] [Training] loss :   34.17836, acc :  0.4484, batch_idx : 200
[Task 10, Epoch 0] [Training] loss :  36.449268, acc :  0.4529, batch_idx : 220
[Task 10, Epoch 0] [Training] loss :  31.467968, acc :  0.4506, batch_idx : 240
[Task 10, Epoch 0] [Training] loss :  29.387077, acc :  0.4545, batch_idx : 260
[Task 10, Epoch 0] [Training] loss :  38.835068, acc :  0.4563, batch_idx : 280
[Task 10, Epoch 1] [Training] loss :  31.007389, acc :   0.5, batch_idx : 0
[Task 10, Epoch 1] [Training] loss :  33.364117, acc :  0.4747, batch_idx : 20
[Task 10, Epoch 1] [Training] loss :  28.619299, acc :  0.4802, batch_idx : 40
[Task 10, Epoch 1] [Training] loss :  34.057922, acc :  0.4641, batch_idx : 60
[Task 10, Epoch 1] [Training] loss :  32.449863, acc :  0.4668, batch_idx : 80
[Task 10, Epoch 1] [Training] loss :  29.971666, acc :  0.4706, batch_idx : 100
[Task 10, Epoch 1] [Training] loss :  31.620749, acc :  0.47, batch_idx : 120
[Task 10, Epoch 1] [Training] loss :  32.884495, acc :  0.4727, batch_idx : 140
[Task 10, Epoch 1] [Training] loss :  38.045845, acc :  0.4717, batch_idx : 160
[Task 10, Epoch 1] [Training] loss :   27.73971, acc :  0.4698, batch_idx : 180
[Task 10, Epoch 1] [Training] loss :  29.684376, acc :  0.4715, batch_idx : 200
[Task 10, Epoch 1] [Training] loss :  32.233799, acc :  0.4713, batch_idx : 220
[Task 10, Epoch 1] [Training] loss :  35.001186, acc :  0.4682, batch_idx : 240
[Task 10, Epoch 1] [Training] loss :  28.398779, acc :  0.4685, batch_idx : 260
[Task 10, Epoch 1] [Training] loss :  30.737085, acc :  0.4693, batch_idx : 280
[Task 10, Epoch 2] [Training] loss :  33.349392, acc :  0.3438, batch_idx : 0
[Task 10, Epoch 2] [Training] loss :  27.977625, acc :  0.4881, batch_idx : 20
[Task 10, Epoch 2] [Training] loss :  27.708397, acc :  0.4817, batch_idx : 40
[Task 10, Epoch 2] [Training] loss :  36.034698, acc :  0.4693, batch_idx : 60
[Task 10, Epoch 2] [Training] loss :  33.423553, acc :  0.4688, batch_idx : 80
[Task 10, Epoch 2] [Training] loss :  30.528635, acc :  0.4731, batch_idx : 100
[Task 10, Epoch 2] [Training] loss :   28.80201, acc :  0.4757, batch_idx : 120
[Task 10, Epoch 2] [Training] loss :  29.157539, acc :  0.4803, batch_idx : 140
[Task 10, Epoch 2] [Training] loss :  31.765087, acc :  0.4845, batch_idx : 160
[Task 10, Epoch 2] [Training] loss :  31.502867, acc :  0.4852, batch_idx : 180
[Task 10, Epoch 2] [Training] loss :  29.833849, acc :  0.4893, batch_idx : 200
[Task 10, Epoch 2] [Training] loss :  26.446497, acc :  0.4914, batch_idx : 220
[Task 10, Epoch 2] [Training] loss :  31.450598, acc :  0.4944, batch_idx : 240
[Task 10, Epoch 2] [Training] loss :  31.787931, acc :  0.4965, batch_idx : 260
[Task 10, Epoch 2] [Training] loss :  29.504681, acc :  0.5044, batch_idx : 280

5、根据代码和理解,用自己的语言简要描述Episodic Memory所进行的操作。

答:Eposodic Memory由Attention Mechanism和Memory Update Mechanism组成,它接受Input Module和Question Module的输出作为输入。首先,它将句子的表达信息facts和问题的描述信息questions一并传递给Attention Mechanism,得到句子信息和问题信息的一个关系。然后将关系信息和句子表达信息facts一同传递给Attention GRU,得到context vector。然后Meomory Update Mechanism会根据context vector来更新Eposodic Memory内部的memory值,并作为输出值输出。

在DMN中会指定参数num_hop,该参数用于指定Eposodic Memory的迭代次数。